dgit.raspbian.org Git

debian/rules: Set CC/LD to enable cross-building

Always set $(CC) and $(LD) when building. This has no effect on native
builds, but is needed for cross-builds to use the right compiler.

Signed-off-by: Elliott Mitchell <ehem+debian@m5p.com>
Acked-by: Hans van Kranenburg <hans@knorrie.org>

d/shuffle-boot-files: The Great Quotification

These should originate with the owner of a build system and are unlikely
to get hazardous values. This script though *should* work on a system
with such a bizzare setup. On general principle, add lots of double-quotes.

Signed-off-by: Elliott Mitchell <ehem+debian@m5p.com>
Acked-by: Hans van Kranenburg <hans@knorrie.org>

d/shuffle-binaries: Add quoting for potentially changeable variables

These should originate with the owner of a build system and are unlikely
to get hazardous values. This script though *should* work on a system
with such a bizarre setup. On general principle, add double-quotes.

Signed-off-by: Elliott Mitchell <ehem+debian@m5p.com>
Acked-by: Hans van Kranenburg <hans@knorrie.org>

d/shuffle-binaries: Make error detection/message overt

The reason for the `ls` at the end is pretty straightforward if you think
about it, mainly confirming the shuffle step did something. Yet that
is a bit non-obvious, and the error message produced on failure is poor
("No such file or directory" simply means the script failed somewhere?).
Add an overt error message.

Signed-off-by: Elliott Mitchell <ehem+debian@m5p.com>
Acked-by: Hans van Kranenburg <hans@knorrie.org>

xen/arm: traps: Don't panic when receiving an unknown debug trap

Even if debug trap are only meant for debugging purpose, it is quite
harsh to crash Xen if one of the trap sent by the guest is not handled.

So switch from a panic() to a printk().

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 957708c2d1ae25d7375abd5e5e70c3043d64f1f1)

xen/arm: acpi: add BAD_MADT_GICC_ENTRY() macro

Imported from Linux commit b6cfb277378ef831c0fa84bcff5049307294adc6:

    The BAD_MADT_ENTRY() macro is designed to work for all of the subtables
    of the MADT.  In the ACPI 5.1 version of the spec, the struct for the
    GICC subtable (struct acpi_madt_generic_interrupt) is 76 bytes long; in
    ACPI 6.0, the struct is 80 bytes long.  But, there is only one definition
    in ACPICA for this struct -- and that is the 6.0 version.  Hence, when
    BAD_MADT_ENTRY() compares the struct size to the length in the GICC
    subtable, it fails if 5.1 structs are in use, and there are systems in
    the wild that have them.

    This patch adds the BAD_MADT_GICC_ENTRY() that checks the GICC subtable
    only, accounting for the difference in specification versions that are
    possible.  The BAD_MADT_ENTRY() will continue to work as is for all other
    MADT subtables.

    This code is being added to an arm64 header file since that is currently
    the only architecture using the GICC subtable of the MADT.  As a GIC is
    specific to ARM, it is also unlikely the subtable will be used elsewhere.

Fixes: aeb823bbacc2 ("ACPICA: ACPI 6.0: Add changes for FADT table.")
Signed-off-by: Al Stone <al.stone@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    [catalin.marinas@arm.com: extra brackets around macro arguments]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit 7056f2f89f03f2f804ac7e776c7b2b000cd716cd)

xen/arm: Introduce fw_unreserved_regions() and use it

Since commit 6e3e77120378 "xen/arm: setup: Relocate the Device-Tree
later on in the boot", the device-tree will not be kept mapped when
using ACPI.

However, a few places are calling dt_unreserved_regions() which expects
a valid DT. This will lead to a crash.

As the DT should not be used for ACPI (other than for detecting the
modules), a new function fw_unreserved_regions() is introduced.

It will behave the same way on DT system. On ACPI system, it will
unreserve the whole region.

Take the opportunity to clarify that bootinfo.reserved_mem is only used
when booting using Device-Tree.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9c2bc0f24b2ba7082df408b3c33ec9a86bf20cf0)

xen/arm: Check if the platform is not using ACPI before initializing Dom0less

Dom0less requires a device-tree. However, since commit 6e3e77120378
"xen/arm: setup: Relocate the Device-Tree later on in the boot", the
device-tree will not get unflatten when using ACPI.

This will lead to a crash during boot.

Given the complexity to setup dom0less with ACPI (for instance how to
assign device?), we should skip any code related to Dom0less when using
ACPI.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Tested-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit dac867bf9adc1562a4cf9db5f89726597af13ef8)

xen/arm: acpi: The fixmap area should always be cleared during failure/unmap

Commit 022387ee1ad3 "xen/arm: mm: Don't open-code Xen PT update in
{set, clear}_fixmap()" enforced that each set_fixmap() should be
paired with a clear_fixmap(). Any failure to follow the model would
result to a platform crash.

Unfortunately, the use of fixmap in the ACPI code was overlooked as it
is calling set_fixmap() but not clear_fixmap().

The function __acpi_os_map_table() is reworked so:
    - We know before the mapping whether the fixmap region is big
    enough for the mapping.
    - It will fail if the fixmap is already in use. This is not a
    change of behavior but clarifying the current expectation to avoid
    hitting a BUG().

The function __acpi_os_unmap_table() will now call clear_fixmap().

Reported-by: Wei Xu <xuwei5@hisilicon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 4d625ff3c3a939dc270b03654337568c30c5ab6e)

xen/acpi: Rework acpi_os_map_memory() and acpi_os_unmap_memory()

The functions acpi_os_{un,}map_memory() are meant to be arch-agnostic
while the __acpi_os_{un,}map_memory() are meant to be arch-specific.

Currently, the former are still containing x86 specific code.

To avoid this rather strange split, the generic helpers are reworked so
they are arch-agnostic. This requires the introduction of a new helper
__acpi_os_unmap_memory() that will undo any mapping done by
__acpi_os_map_memory().

Currently, the arch-helper for unmap is basically a no-op so it only
returns whether the mapping was arch specific. But this will change
in the future.

Note that the x86 version of acpi_os_map_memory() was already able to
able the 1MB region. Hence why there is no addition of new code.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Rahul Singh <rahul.singh@arm.com>
Tested-by: Elliott Mitchell <ehem+xen@m5p.com>
(cherry picked from commit 1c4aa69ca1e1fad20b2158051eb152276d1eb973)

xen/arm: acpi: Don't fail if SPCR table is absent

Absence of a SPCR table likely means the console is a framebuffer. In
such case acpi_iomem_deny_access() should NOT fail.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 861f0c110976fa8879b7bf63d9478b6be83d4ab6)

tools/python: Pass linker to Python build process

Unexpectedly the environment variable which needs to be passed is
$LDSHARED and not $LD. Otherwise Python may find the build `ld` instead
of the host `ld`.

Replace $(LDFLAGS) with $(SHLIB_LDFLAGS) as Python needs shared objects
it can load at runtime, not executables.

This uses $(CC) instead of $(LD) since Python distutils appends $CFLAGS
to $LDFLAGS which breaks many linkers.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit 17d192e0238d6c714e9f04593b59597b7090be38)

[ Hans van Kranenburg ]
Fixed cherry-pick conflict because we have LIBEXEC_LIB=$(LIBEXEC_LIB) in
between in the same lines. The line wrap mess makes it a bit hard to
follow.

xen/rpi4: implement watchdog-based reset

The preferred method to reboot RPi4 is PSCI. If it is not available,
touching the watchdog is required to be able to reboot the board.

The implementation is based on
drivers/watchdog/bcm2835_wdt.c:__bcm2835_restart in Linux v5.9-rc7.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Tested-by: Roman Shaposhnik <roman@zededa.com>
CC: roman@zededa.com
(cherry picked from commit 25849c8b16f2a5b7fcd0a823e80a5f1b590291f9)

d/rules: do not compress /usr/share/doc/xen/html

So, we have a xen-doc package which is already split off nicely as
binary package to contain the upstream html collection of documentation.

By default, dh_compress will gzip -9 .txt files which it considers to
be too large. However, in our case, there's some index.html with links,
and if this happens, the user who consciously installs the xen-doc
package, wanting to waste disk space on this ends up with a confusing
pile of broken links when navigating to file:///usr/share/doc/xen/html/
and browsing around.

The difference between disk space occupied when not compressing is 5.2M
vs. 5.1M before.

So, add an exclude on /usr/share/doc/xen/html

Closes: #942611
Reported-by: Diederik de Haas <didi.debian@cknow.org>
Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

t/h/L/vif-common.sh: fix handle_iptable return value

A return statement without explicit value will return the value of the
last command executed before this line with return was encountered.

This is not what we want. return 0.

Closes: #955994
Fixes: 2e0814f971dd ("vif-common: disable handle_iptable")
Reported-by: Samuel Thibault <sthibault@debian.org>
Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

tools: Partially revert "Cross-compilation fixes."

This partially reverts commit 16504669c5cbb8b195d20412aadc838da5c428f7.

Doesn't look like much of 16504669c5cbb8b195d20412aadc838da5c428f7
actually remains due to passage of time.

Of the 3, both Python and pygrub appear to mostly be building just fine
cross-compiling. The OCAML portion is being troublesome, this is going
to cause bug reports elsewhere soon. The OCAML portion though can
already be disabled by setting OCAML_TOOLS=n and shouldn't have this
extra form of disabling.

Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 69953e2856382274749b617125cc98ce38198463)

tools: don't build/ship xenmon

It can't run with Python 3, and I'm not going to fix it.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

tools/xl/bash-completion: also complete 'xen'

We have the `xen` alias for xl in Debian, since in the past it was a
command that could execute either xl or xm.

Now, it always does xl, so, complete the same stuff for it as we have
for xl.

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
[git-debrebase split: mixed commit: upstream part]

pygrub: Specify -rpath LIBEXEC_LIB when building fsimage.so

If LIBEXEC_LIB is not on the default linker search path, the python
fsimage.so module fails to find libfsimage.so.

Add the relevant directory to the rpath explicitly.

(This situation occurs in the Debian package, where
--with-libexec-libdir is used to put each Xen version's libraries and
utilities in their own directory, to allow them to be coinstalled.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

pygrub: Set sys.path

We install libfsimage in a non-standard path for Reasons.
(See debian/rules.)

This patch was originally part of `tools-pygrub-prefix.diff'
(eg commit 51657319be54) and included changes to the Makefile to
change the installation arrangements (we do that part in the rules now
since that is a lot less prone to conflicts when we update) and to
shared library rpath (which is now done in a separate patch).

(Commit message rewritten by Ian Jackson.)

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
squash! pygrub: Set sys.path and rpath

hotplug-common: Do not adjust LD_LIBRARY_PATH

This is in the upstream script because on non-Debian systems, the
default install locations in /usr/local/lib might not be on the linker
path, and as a result the hotplug scripts would break.

A reason we might need it in Debian is our multiple version
coinstallation scheme. However, the hotplug scripts all call the
utilities via the wrappers, and the binaries are configured to load
from the right place anyway.

This setting is an annoyance because it requires libdir, which is an
arch-specific path but comes from a file we want to put in
xen-utils-common, an arch:all package.

So drop this setting.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

sysconfig.xencommons.in: Strip and debianize

Strip all options that are for stuff we don't ship, which is 1)
xenstored as stubdom and 2) xenbackendd, which seems to be dead code
anyway. [1]

It seems useful to give the user the option to revert to xenstored
instead of the default oxenstored if they really want.

[1] https://lists.xen.org/archives/html/xen-devel/2015-07/msg04427.html

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
Acked-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

vif-common: disable handle_iptable

Also see Debian bug #894013. The current attempt at providing
anti-spoofing rules results in a situation that does not have any
effect. Also note that forwarding bridged traffic to iptables is not
enabled by default, and that for openvswitch users it does not make any
sense.

So, stop cluttering the live iptables ruleset.

This functionality seems to be introduced before 2004 and since then it
has never got some additional love.

It would be nice to have a proper discussion upstream about how Xen
could provide some anti mac/ip spoofing in the dom0. It does not seem to
be a trivial thing to do, since it requires having quite some knowledge
about what the domU is allowed to do or not (e.g. a domU can be a
router...).

Fix empty fields in first hypervisor log line

Instead of:

    (XEN) Xen version 4.11.1 (Debian )
    (@)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 19:08:37 UTC 2019

I'd like to see:

    (XEN) Xen version 4.11.1 (Debian 4.11.1-1~)
    (pkg-xen-devel@lists.alioth.debian.org)
    (gcc (Debian 8.2.0-13) 8.2.0) debug=n
    Thu Jan  3 22:44:00 CET 2019

The substitution was broken since the great packaging refactoring,
because the directory in which the build is done changed.

Also, use the Maintainer address from debian/control instead of the most
recent changelog entry. If someone wants to use the address to ask a
question, they will end up at the team mailing list, which is better
than an individual person.

docs/man/xen-vbd-interface.7: Provide properly-formatted NAME section

This manpage was omitted from
docs/man: Provide properly-formatted NAME sections
because I was previously building with markdown not installed.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

shim: Provide separate install-shim target

When building on a 32-bit userland, the user wants to build 32-bit
tools and a 64-bit hypervisor.  This involves setting XEN_TARGET_ARCH
to different values for the tools build and the hypervisor build.

So the user must invoke the tools build and the hypervisor build
separately.

However, although the shim is done by the tools/firmware Makefile, its
bitness needs to be the same as the hypervisor, not the same as the
tools.  When run with XEN_TARGET_ARCH=x86_32, it it skipped, which is
wrong.

So the user must invoke the shim build separately.  This can be done
with
   make -C tools/firmware/xen-dir XEN_TARGET_ARCH=x86_64

However, tools/firmware/xen-dir has no `install' target.  The
installation of all `firmware' is done in tools/firmware/Makefile.  It
might be possible to fix this, but it is not trivial.  For example,
the definitions of INST_DIR and DEBG_DIR would need to be copied, as
would an appropriate $(INSTALL_DIR) call.

For now, provide an `install-shim' target in tools/firmware/Makefile.

This has to be called from `install' of course.  We can't make it
a dependency of `install' because it might be run before `all' has
completed.  We could make it depend on a `shim' target but such
a target is nearly impossible to write because everything is done by
the inflexible subdir-$@ machinery.

The overally result of this patch is that existing make invocations
work as before.  But additionally, the user can say
  make -C tools/firmware install-shim XEN_TARGET_ARCH=x86_64
to install the shim.  The user must have built it already.
Unlike the build rune, this install-rune is properly conditional
so it is OK to call on ARM.

What a mess.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makefile: CONFIG_PV_SHIM: enable only on x86_64

Previously this was *dis*abled for x86_*32*. But if someone should
run some of this Makefile on ARM, say, it ought not to be built
either.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

tools/firmware/Makfile: Respect caller's CONFIG_PV_SHIM

This makes it easier to disable the shim build. (In Debian we need to
build the shim separately because it needs different compiler flags
and a different XEN_COMPILE_ARCH.

Signed-off-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

Revert "pvshim: make PV shim build selectable from configure"

This reverts commit 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2.

This upstream change changes stuff that breaks our very fragile mess
that builds the shim when it needs to, and doesn't when it should not.

The result is that it's missing in the end for the i386 build... :|

    dh_install: warning: Cannot find (any matches for)
    "usr/lib/debug/usr/lib/xen-*/boot/*" (tried in ., debian/tmp)

    dh_install: warning: xen-utils-4.14 missing files:
    usr/lib/debug/usr/lib/xen-*/boot/*
    dh_install: error: missing files, aborting

.gitignore: Add configure output which we always delete and regenerate

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

autoconf: Provide libexec_libdir_suffix

This is going to be used to put libfsimage.so into a path containing
the multiarch triplet.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools-libfsimage-prefix.diff

\o/

Do not build the instruction emulator

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

tools/tests/x86_emulator: Pass -no-pie -fno-pic to gcc on x86_32

The current build fails with GCC6 on Debian sid i386 (unstable):

/tmp/ccqjaueF.s: Assembler messages:
/tmp/ccqjaueF.s:3713: Error: missing or invalid displacement expression `vmovd_to_reg_len@GOT'

This is due to the combination of GCC6, and Debian's decision to
enable some hardening flags by default (to try to make runtime
addresses less predictable):
  https://wiki.debian.org/Hardening/PIEByDefaultTransition

This is of no benefit for the x86 instruction emulator test, which is
a rebuild of the emulator code for testing purposes only.  So pass
options to disable this.

These options will be no-ops if they are the same as the compiler
default.

On amd64, the -fno-pic breaks the build in a different way.  So do
this only on i386.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Andrew Cooper <andrew.cooper3@citrix.com>
Gbp-Pq: Topic misc
Gbp-Pq: Name toolstestsx86_emulator-pass--no-pie--fno.patch

Remove static solaris support from pygrub

Patch-Name: tools-pygrub-remove-static-solaris-support

Gbp-Pq: Topic misc
Gbp-Pq: Name tools-pygrub-remove-static-solaris-support

Do not ship COPYING into /usr/include

This is not wanted in Debian. COPYING ends up in
/usr/share/doc/xen-*copyright.

Patch-Name: tools-include-no-COPYING.diff

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

config-prefix.diff

Patch-Name: config-prefix.diff

Gbp-Pq: Topic prefix-abiname
Gbp-Pq: Name config-prefix.diff

version

Delete configure output

These autogenerated files are not useful in Debian; dh_autoreconf will
regenerate them.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Delete config.sub and config.guess

dh_autoreconf will provide these back.

If this patch does not apply when rebasing, you can simply delete the
files again.

Signed-off-by: Ian Jackson <ian.jackson@citrix.com>

Update changelog for new upstream 4.14.0+88-g1d1d1f5391

[git-debrebase changelog: new upstream 4.14.0+88-g1d1d1f5391]

Update to upstream 4.14.0+88-g1d1d1f5391

[git-debrebase anchor: new upstream 4.14.0+88-g1d1d1f5391, merge]

finalise changelog for unstable

Signed-off-by: Ian Jackson <iwj@xenproject.org>

changelog: document ~exp2

Signed-off-by: Ian Jackson <iwj@xenproject.org>

x86/vioapic: fix usage of index in place of GSI in vioapic_write_redirent

The usage of idx instead of the GSI in vioapic_write_redirent when
accessing gsi_assert_count can cause a PVH dom0 with multiple
vIO-APICs to lose interrupts in case a pin of a IO-APIC different than
the first one is unmasked with pending interrupts.

Switch to use gsi instead to fix the issue.

Fixes: 9f44b08f7d0e4 ('x86/vioapic: introduce support for multiple vIO APICS')
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Roger Pau Monné <roge.rpau@citrix.com>
Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3ae469af8e680df31eecd0a2ac6a83b58ad7ce53
master date: 2020-11-30 14:06:38 +0100

xen/events: rework fifo queue locking

Two cpus entering evtchn_fifo_set_pending() for the same event channel
can race in case the first one gets interrupted after setting
EVTCHN_FIFO_PENDING and when the other one manages to set
EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
lead to evtchn_check_pollers() being called before the event is put
properly into the queue, resulting eventually in the guest not seeing
the event pending and thus blocking forever afterwards.

Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
lock") made the race just more obvious, while the fifo event channel
implementation had this race forever since the introduction and use of
per-channel locks, when an unmask operation was running in parallel with
an event channel send operation.

Using a spinlock for the per event channel lock had turned out
problematic due to some paths needing to take the lock are called with
interrupts off, so the lock would need to disable interrupts, which in
turn broke some use cases related to vm events.

For avoiding this race the queue locking in evtchn_fifo_set_pending()
needs to be reworked to cover the test of EVTCHN_FIFO_PENDING,
EVTCHN_FIFO_MASKED and EVTCHN_FIFO_LINKED, too. Additionally when an
event channel needs to change queues both queues need to be locked
initially, in order to avoid having a window with no lock held at all.

Reported-by: Jan Beulich <jbeulich@suse.com>
Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock")
Fixes: de6acb78bf0e137c ("evtchn: use a per-event channel lock for sending events")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71ac522909e9302350a88bc378be99affa87067c
master date: 2020-11-30 14:05:39 +0100

x86/DMI: fix SMBIOS pointer range check

Forever since its introduction this has been using an inverted relation
operator.

Fixes: 54057a28f22b ("x86: support SMBIOS v3")
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6befe598706218673b14710d90d00ce90763b372
master date: 2020-11-24 11:25:29 +0100

xen/events: access last_priority and last_vcpu_id together

The queue for a fifo event is depending on the vcpu_id and the
priority of the event. When sending an event it might happen the
event needs to change queues and the old queue needs to be kept for
keeping the links between queue elements intact. For this purpose
the event channel contains last_priority and last_vcpu_id values
elements for being able to identify the old queue.

In order to avoid races always access last_priority and last_vcpu_id
with a single atomic operation avoiding any inconsistencies.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 1277cb9dc5e966f1faf665bcded02b7533e38078
master date: 2020-11-24 11:23:42 +0100

x86/vpt: fix build with old gcc

I believe it was the XSA-336 fix (42fcdd42328f "x86/vpt: fix race when
migrating timers between vCPUs") which has unmasked a bogus
uninitialized variable warning. This is observable with gcc 4.3.4, but
only on 4.13 and older; it's hidden on newer versions apparently due to
the addition to _read_unlock() done by 12509bbeb9e3 ("rwlocks: call
preempt_disable() when taking a rwlock").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f2c620aa062767b318267d678ae249dcb637b870
master date: 2020-11-18 12:38:01 +0100

xen/evtchn: revert 52e1fc47abc3a0123

With the event channel lock no longer disabling interrupts commit
52e1fc47abc3a0123 ("evtchn/Flask: pre-allocate node on send path") can
be reverted again.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b5ad37f8e9284cc147218f7a5193d739ae7b956f
master date: 2020-11-10 14:37:15 +0100

xen/evtchn: rework per event channel lock

Currently the lock for a single event channel needs to be taken with
interrupts off, which causes deadlocks in some cases.

Rework the per event channel lock to be non-blocking for the case of
sending an event and removing the need for disabling interrupts for
taking the lock.

The lock is needed for avoiding races between event channel state
changes (creation, closing, binding) against normal operations (set
pending, [un]masking, priority changes).

Use a rwlock, but with some restrictions:

- Changing the state of an event channel (creation, closing, binding)
  needs to use write_lock(), with ASSERT()ing that the lock is taken as
  writer only when the state of the event channel is either before or
  after the locked region appropriate (either free or unbound).

- Sending an event needs to use read_trylock() mostly, in case of not
  obtaining the lock the operation is omitted. This is needed as
  sending an event can happen with interrupts off (at least in some
  cases).

- Dumping the event channel state for debug purposes is using
  read_trylock(), too, in order to avoid blocking in case the lock is
  taken as writer for a long time.

- All other cases can use read_lock().

Fixes: e045199c7c9c54 ("evtchn: address races with evtchn_reset()")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
xen/events: fix build

Commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock")
introduced a build failure for NDEBUG builds.

Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 5f2df45ead7c1195142f68b7923047a1e9479d54
master date: 2020-11-10 14:36:15 +0100
master commit: 53bacb86f496fdb11560d9e3b361bca7de60d268
master date: 2020-11-11 08:56:21 +0100

memory: fix off-by-one in XSA-346 change

The comparison against ARRAY_SIZE() needs to be >= in order to avoid
overrunning the pages[] array.

This is XSA-355.

Fixes: 5777a3742d88 ("IOMMU: hold page ref until after deferred TLB flush")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 9b156bcc3ffcc7949edd4460b718a241e87ae302
master date: 2020-11-24 14:01:31 +0100

debian/changelog: finalise 4.14.0+80-gd101b417b7-1~exp1

debian/changelog: fix typo

Partially revert "debian/rules: Combine shared Make args"

This reverts part of commit af30268ed73397d24acaee83477ff555a43df5e4.

The change caused the following FTBFS on i386:

    gcc    -Wl,-z,relro -Wl,-z,now -pthread -Wl,-soname
    -Wl,libxentoolcore.so.1 -shared -Wl,--version-script=libxentoolcore.map
    -o libxentoolcore.so.1.0 handlereg.opic
    /usr/bin/ld: i386:x86-64 architecture of input file `handlereg.opic' is
    incompatible with i386 output
    /usr/bin/ld: handlereg.opic: file class ELFCLASS64 incompatible with
    ELFCLASS32
    /usr/bin/ld: final link failed: file in wrong format
    collect2: error: ld returned 1 exit status

"For $(make_args_xen), ${XEN_TARGET_ARCH} gets $(xen_arch_$(flavour)),
but for $(make_args_tools), ${XEN_TARGET_ARCH} gets
$(xen_arch_$(DEB_HOST_ARCH))."

Update changelog for new upstream 4.14.0+80-gd101b417b7

[git-debrebase changelog: new upstream 4.14.0+80-gd101b417b7]

Update to upstream 4.14.0+80-gd101b417b7

[git-debrebase anchor: new upstream 4.14.0+80-gd101b417b7, merge]

WIP debian/changelog: start 4.14.0-1~exp2

Signed-off-by: Hans van Kranenburg <hans@knorrie.org>

d/xen-utils-common.xen.init: disable oom killer for xenstored

In case of oom killer terminating some process, we'd rather not see
xenstored go. Xenstored has an in-memory database, and when starting the
process again, it would be empty, which is very inconvenient. Xenstored
should already score quite low and have a fairly low memory footprint,
but according to the user report, it happened.

Closes: #961511
Suggested-by: Samuel Thibault <sthibault@debian.org>
Signed-off-by: Hans van Kranenburg <hans@knorrie.org>
Acked-by: Ian Jackson <ijackson@chiark.greenend.org.uk>

x86/msr: Disallow guest access to the RAPL MSRs

Researchers have demonstrated using the RAPL interface to perform a
differential power analysis attack to recover AES keys used by other cores in
the system.

Furthermore, even privileged guests cannot use this interface correctly, due
to MSR scope and vcpu scheduling issues. The interface would want to be
paravirtualised to be used sensibly.

Disallow access to the RAPL MSRs completely, as well as other MSRs which
potentially access fine grain power information.

This is part of XSA-351.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL}

Currently a PV hardware domain can also be given control over the CPU
frequency, and such guest is allowed to write to MSR_IA32_PERF_CTL.
However since commit 322ec7c89f6 the default behavior has been changed
to reject accesses to not explicitly handled MSRs, preventing PV
guests that manage CPU frequency from reading
MSR_IA32_PERF_{STATUS/CTL}.

Additionally some HVM guests (Windows at least) will attempt to read
MSR_IA32_PERF_CTL and will panic if given back a #GP fault:

vmx.c:3035:d8v0 RDMSR 0x00000199 unimplemented
d8v0 VIRIDIAN CRASH: 3b c0000096 fffff806871c1651 ffffda0253683720 0

Move the handling of MSR_IA32_PERF_{STATUS/CTL} to the common MSR
handling shared between HVM and PV guests, and add an explicit case
for reads to MSR_IA32_PERF_{STATUS/CTL}.

Restore previous behavior and allow PV guests with the required
permissions to read the contents of the mentioned MSRs. Non privileged
guests will get 0 when trying to read those registers, as writes to
MSR_IA32_PERF_CTL by such guest will already be silently dropped.

Fixes: 322ec7c89f6 ('x86/pv: disallow access to unknown MSRs')
Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3059178798a23ba870ff86ff54d442a07e6651fc)

xen/arm: Always trap AMU system registers

The Activity Monitors Unit (AMU) has been introduced by ARMv8.4. It is
considered to be unsafe to be expose to guests as they might expose
information about code executed by other guests or the host.

Arm provided a way to trap all the AMU system registers by setting
CPTR_EL2.TAM to 1.

Unfortunately, on older revision of the specification, the bit 30 (now
CPTR_EL1.TAM) was RES0. Because of that, Xen is setting it to 0 and
therefore the system registers would be exposed to the guest when it is
run on processors with AMU.

As the bit is mark as UNKNOWN at boot in Armv8.4, the only safe solution
for us is to always set CPTR_EL1.TAM to 1.

Guest trying to access the AMU system registers will now receive an
undefined instruction. Unfortunately, this means that even well-behaved
guest may fail to boot because we don't sanitize the ID registers.

This is a known issues with other Armv8.0+ features (e.g. SVE, Pointer
Auth). This will taken care separately.

This is part of XSA-351 (or XSA-93 re-born).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit 628e1becb6fb121475a6ce68e3f1cb4499851255)

tools/libs/stat: use memcpy instead of strncpy in getBridge

Use memcpy in getBridge to prevent gcc warnings about truncated
strings. We know that we might truncate it, so the gcc warning
here is wrong.
Revert previous change changing buffer sizes as bigger buffers
are not needed.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 40fe714ca4245a9716264fcb3ee8df42c3650cf6)

tool/libs/light: Fix libxenlight gcc warning

Fix gcc10 compilation warning about uninitialized variable by setting
it to 0.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0241809bf838875615797f52af34222e5ab8e98f)

tools/libxc: report malloc errors in writev_exact

The caller of writev_exact should be notified about malloc errors
when dealing with partial writes.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0d8d289af7a679c028462c4ed5d98586f9ef9648)

tools/libs/stat: fix broken build

Making getBridge() static triggered a build error with some gcc versions:

error: 'strncpy' output may be truncated copying 15 bytes from a string of
length 255 [-Werror=stringop-truncation]

Fix that by using a buffer with 256 bytes instead.

Fixes: 6d0ec053907794 ("tools: split libxenstat into new tools/libs/stat directory")
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit c8099e48c3dbb8c7399551a265756ecf354eece2)

tools/xenstore: Do not abort xenstore-ls if a node disappears while iterating

The do_ls() function has somewhat inconsistent handling of errors.

If reading the node's contents with xs_read() fails, then do_ls() will
just quietly not display the contents.

If reading the node's permissions with xs_get_permissions() fails, then
do_ls() will print a warning, continue, and ultimately won't exit with
an error code (unless another error happens).

If recursing into the node with xs_directory() fails, then do_ls() will
abort immediately, not printing any further nodes.

For persistent failure modes — such as ENOENT because a node has been
removed, or EACCES because it has had its permisions changed since the
xs_directory() on the parent directory returned its name — it's
obviously quite likely that if either of the first two errors occur for
a given node, then so will the third and thus xenstore-ls will abort.

The ENOENT one is actually a fairly common case, and has caused tools to
fail to clean up a network device because it *apparently* already
doesn't exist in xenstore.

There is a school of thought that says, "Well, xenstore-ls returned an
error. So the tools should not trust its output."

The natural corollary of this would surely be that the tools must re-run
xenstore-ls as many times as is necessary until its manages to exit
without hitting the race condition. I am not keen on that conclusion.

For the specific case of ENOENT it seems reasonable to declare that,
but for the timing, we might as well just not have seen that node at
all when calling xs_directory() for the parent. By ignoring the error,
we give acceptable output.

The issue can be reproduced as follows:

(dom0) # for a in `seq 1 1000` ; do
              xenstore-write /local/domain/2/foo/$a $a ;
         done

Now simultaneously:

(dom0) # for a in `seq 1 999` ; do
              xenstore-rm /local/domain/2/foo/$a ;
         done
(dom2) # while true ; do
              ./xenstore-ls -p /local/domain/2/foo | grep -c 1000 ;
         done

We should expect to see node 1000 in the output, every time.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit beb105af19cc3e58e2ed49224cfe190a736e5fa0)

tools/xenpmd: Fix gcc10 snprintf warning

Add a check for snprintf return code and ignore the entry if we get an
error. This should in fact never happen and is more a trick to make gcc
happy and prevent compilation errors.

This is solving the following gcc warning when compiling for arm32 host
platforms with optimization activated:
xenpmd.c:92:37: error: '%s' directive output may be truncated writing
between 4 and 2147483645 bytes into a region of size 271
[-Werror=format-truncation=]

This is also solving the following Debian bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970802

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0dfddb2116e3757f77a691a3fe335173088d69dc)

libxl: fix -Werror=stringop-truncation in libxl__prepare_sockaddr_un

In file included from /usr/include/string.h:495,
                 from libxl_internal.h:38,
                 from libxl_utils.c:20:
In function 'strncpy',
    inlined from 'libxl__prepare_sockaddr_un' at libxl_utils.c:1262:5:
/usr/include/bits/string_fortified.h:106:10: error: '__builtin_strncpy' specified bound 108 equals destination size [-Werror=stringop-truncation]
  106 |   return __builtin___strncpy_chk (__dest, __src, __len, __bos (__dest));
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit fff1b7f50e75ad9535c86f3fcf425b4945c50a1c)

libxl: workaround gcc 10.2 maybe-uninitialized warning

It seems xlu_pci_parse_bdf has a state machine that is too complex for
gcc to understand. The build fails with:

    libxlu_pci.c: In function 'xlu_pci_parse_bdf':
    libxlu_pci.c:32:18: error: 'func' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       32 |     pcidev->func = func;
          |     ~~~~~~~~~~~~~^~~~~~
    libxlu_pci.c:51:29: note: 'func' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |                             ^~~~
    libxlu_pci.c:31:17: error: 'dev' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       31 |     pcidev->dev = dev;
          |     ~~~~~~~~~~~~^~~~~
    libxlu_pci.c:51:24: note: 'dev' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |                        ^~~
    libxlu_pci.c:30:17: error: 'bus' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       30 |     pcidev->bus = bus;
          |     ~~~~~~~~~~~~^~~~~
    libxlu_pci.c:51:19: note: 'bus' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |                   ^~~
    libxlu_pci.c:29:20: error: 'dom' may be used uninitialized in this function [-Werror=maybe-uninitialized]
       29 |     pcidev->domain = domain;
          |     ~~~~~~~~~~~~~~~^~~~~~~~
    libxlu_pci.c:51:14: note: 'dom' was declared here
       51 |     unsigned dom, bus, dev, func, vslot = 0;
          |              ^~~
    cc1: all warnings being treated as errors

Workaround it by setting the initial value to invalid value (0xffffffff)
and then assert on each value being set. This way we mute the gcc
warning, while still detecting bugs in the parse code.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit d25cc3ec93ebda030349045d2c7fa14ffde07ed7)

SUPPORT: Add linux device model stubdom to Toolstack

Add qemu-xen linux device model stubdomain to the Toolstack section as a
Tech Preview.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 4ddd6499d999a7d08cabfda5b0262e473dd5beed
master date: 2020-10-23 17:14:08 +0100

arm,smmu: match start level of page table walk with P2M

Don't hardcode the lookup start level of the page table walk to 1
and instead match the one used in P2M. This should fix scenarios
involving SMMU where the start level is different than 1.
In order for the SMMU driver to also compile on arm32 move the
P2M_ROOT_LEVEL in the p2m header file (while at it, for
consistency also P2M_ROOT_ORDER) and use the macro in the smmu
driver.

Signed-off-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9ae1197582798b394d696cff94c4d742319bdbbf)

xen/arm: sched: Ensure the vCPU context is seen before vcpu_pause() returns

Some callers of vcpu_pause() will expect to access the latest vcpu
context when the function returns (see XENDOMCTL_{set,get}vcpucontext}.

However, the latest vCPU context can only be observed after
v->is_running has been observed to be false.

As there is no memory barrier instruction generated, a processor could
try to speculatively access the vCPU context before it was observed.

To prevent the corruption of the vCPU context, we need to insert a
memory barrier instruction after v->is_running is observed and before
the context is accessed. This barrier is added in sync_vcpu_execstate()
as it seems to be the place where we expect the synchronization to
happen.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit f6790389613cd54775ece6575013a679572b46b3)

xen/arm: bootfdt: Ignore empty memory bank

At the moment, Xen will stop processing the Device Tree if a memory
bank is empty (size == 0).

Unfortunately, some of the Device Tree (such as on Colibri imx8qxp)
may contain such a bank. This means Xen will not be able to boot
properly.

Relax the check to just ignore the banks. FWIW this also seems to be the
behavior adopted by Linux.

Reported-by: Daniel Wagner <Daniel.Wagner2@itk-engineering.de>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 5a37207df52066efefe419c677b089a654d37afc)

xen/arm64: force gcc 10+ to always inline generic atomics helpers

Recent versions of gcc (at least 10.x) will not inline generic atomics
helpers by default. Instead they will expect the software to either link
with libatomic.so or implement the helpers, which would result in

undefined reference to `__aarch64_ldadd4_acq_rel'

for us (not having any local implementation).

To keep the previous behavior, force gcc to always inline the generic
atomics helpers.

Long term we probably want to avoid relying on gcc atomics helpers as
this doesn't allow us to switch between LSE and LL/SC atomics.

Suggested-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 5d45ecabe3c0b2097df623ab7b471f8915cfdde6)

x86emul: fix PINSRW and adjust other {,V}PINSR*

The use of simd_packed_int together with no further update to op_bytes
has lead to wrong signaling of #GP(0) for PINSRW without a 16-byte
aligned memory operand. Use simd_none instead and override it after
general decoding with simd_other, like is done for the B/D/Q siblings.

While benign, for consistency also use DstImplicit instead of DstReg
in x86_decode_twobyte().

PINSR{B,D,Q} also had a stray (redundant) get_fpu() invocation, which
gets dropped.

For further consistency also
- use src.bytes instead of op_bytes in relevant memcpy() invocations,
- avoid the pointless updating of op_bytes (all we care about later is
that the value be less than 16).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 06f0598b41f23c9e4cf7d8c5a05b282de92f3a35
master date: 2020-10-23 18:03:18 +0200

pci: cleanup MSI interrupts before removing device from IOMMU

Doing the MSI cleanup after removing the device from the IOMMU leads
to the following panic on AMD hardware:

Assertion 'table.ptr && (index < intremap_table_entries(table.ptr, iommu))' failed at iommu_intr.c:172
----[ Xen-4.13.1-10.0.3-d  x86_64  debug=y   Not tainted ]----
CPU:    3
RIP:    e008:[<ffff82d08026ae3c>] drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
[...]
Xen call trace:
   [<ffff82d08026ae3c>] R drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
   [<ffff82d08026af25>] F drivers/passthrough/amd/iommu_intr.c#update_intremap_entry_from_msi_msg+0xc0/0x342
   [<ffff82d08026ba65>] F amd_iommu_msi_msg_update_ire+0x98/0x129
   [<ffff82d08025dd36>] F iommu_update_ire_from_msi+0x1e/0x21
   [<ffff82d080286862>] F msi_free_irq+0x55/0x1a0
   [<ffff82d080286f25>] F pci_cleanup_msi+0x8c/0xb0
   [<ffff82d08025cf52>] F pci_remove_device+0x1af/0x2da
   [<ffff82d0802a42d1>] F do_physdev_op+0xd18/0x1187
   [<ffff82d080383925>] F pv_hypercall+0x1f5/0x567
   [<ffff82d08038a432>] F lstar_enter+0x112/0x120

That's because the call to iommu_remove_device on AMD hardware will
remove the per-device interrupt remapping table, and hence the call to
pci_cleanup_msi done afterwards will find a null intremap table and
crash.

Reorder the calls so that MSI interrupts are torn down before removing
the device from the IOMMU.

Fixes: d7cfeb7c13ed ("AMD/IOMMU: don't blindly allocate interrupt remapping tables")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 710f62cc826bb8c7ead99f9d6b6b269e39ff3e98
master date: 2020-10-23 10:13:14 +0200

build: use if_changed more consistently (and correctly) for prelink*.o

Switch to $(call if_changed,ld) where possible; presumably not doing so
in e321576f4047 ("xen/build: start using if_changed") right away was an
oversight, as it did for Arm in (just) one case. It failed to add
prelink.o to $(targets), though, causing - judging from the observed
behavior on x86 - undue rebuilds of the final binary (because of
prelink.o getting rebuild for $(cmd_prelink.o) being empty, in turn
because of .prelink.o.cmd not getting read) during "make install-xen".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: dd2cfba88c3d0e144ffec07c6b5b86e54a9d98a9
master date: 2020-09-22 10:19:38 +0200

SUPPORT.md: Desupport qemu trad except stub dm

While investigating XSA-335 we discovered that many upstream security
fixes were missing. It is not practical to backport them. There is
no good reason to be running this very ancient version of qemu, except
that it is the only way to run a stub dm which is currently supported
by upstream.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 8587160b3e2951b722d395a0346bb17c3c22152f
master date: 2020-11-04 09:22:37 +0100

x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)

x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)

hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region

ACPI specification contains statements describing memory marked with regular
"ACPI data" type as reclaimable by the guest. Although the guest shouldn't
really do it if it wants kexec or similar functionality to work, there
could still be ambiguities in treating these regions as potentially regular
RAM.

One such example is SeaBIOS which currently reports "ACPI data" regions as
RAM to the guest in its e801 call. Which it might have the right to do as any
user of this is expected to be ACPI unaware. But a QEMU bootloader later seems
to ignore that fact and is instead using e801 to find a place for initrd which
causes the tables to be erased. While arguably QEMU bootloader or SeaBIOS need
to be fixed / improved here, that is just one example of the potential problems
from using a reclaimable memory type.

Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
described by the spec as non-reclaimable (so cannot ever be treated like RAM).

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: de6d188a519f9e3b7a1acc7784adf4c243865f9a
master date: 2020-10-20 08:54:23 +0200

x86/smpboot: Don't unconditionally call memguard_guard_stack() in cpu_smpboot_alloc()

cpu_smpboot_alloc() is designed to be idempotent with respect to partially
initialised state. This occurs for S3 and CPU parking, where enough state to
handle NMIs/#MCs needs to remain valid for the entire lifetime of Xen, even
when we otherwise want to offline the CPU.

For simplicity between various configuration, Xen always uses shadow stack
mappings (Read-only + Dirty) for the guard page, irrespective of whether
CET-SS is enabled.

Unfortunately, the CET-SS changes in memguard_guard_stack() broke idempotency
by first writing out the supervisor shadow stack tokens with plain writes,
then changing the mapping to being read-only.

This ordering is strictly necessary to configure the BSP, which cannot have
the supervisor tokens be written with WRSS.

Instead of calling memguard_guard_stack() unconditionally, call it only when
actually allocating a new stack. Xenheap allocates are guaranteed to be
writeable, and the net result is idempotency WRT configuring stack_base[].

Fixes: 91d26ed304f ("x86/shstk: Create shadow stacks")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a7952a320c1e202a218702bfdb14f75132f04894
master date: 2020-10-16 11:56:59 +0100

x86/traps: 'Fix' safety of read_registers() in #DF path

All interrupts and exceptions pass a struct cpu_user_regs up into C.  This
contains the legacy vm86 fields from 32bit days, which are beyond the
hardware-pushed frame.

Accessing these fields is generally illegal, as they are logically out of
bounds for anything other than an interrupt/exception hitting ring1/3 code.

show_registers() unconditionally reads these fields, but the content is
discarded before use.  This is benign right now, as all parts of the stack are
readable, including the guard pages.

However, read_registers() in the #DF handler writes to these fields as part of
preparing the state dump, and being IST, hits the adjacent stack frame.

This has been broken forever, but c/s 6001660473 "x86/shstk: Rework the stack
layout to support shadow stacks" repositioned the #DF stack to be adjacent to
the guard page, which turns this OoB write into a fatal pagefault:

  (XEN) *** DOUBLE FAULT ***
  (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:  C   ]----
  (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:  C   ]----
  (XEN) CPU:    4
  (XEN) RIP:    e008:[<ffff82d04031fd4f>] traps.c#read_registers+0x29/0xc1
  (XEN) RFLAGS: 0000000000050086   CONTEXT: hypervisor (d1v0)
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d04031fd4f>] R traps.c#read_registers+0x29/0xc1
  (XEN)    [<ffff82d0403207b3>] F do_double_fault+0x3d/0x7e
  (XEN)    [<ffff82d04039acd7>] F double_fault+0x107/0x110
  (XEN)
  (XEN) Pagetable walk from ffff830236f6d008:
  (XEN)  L4[0x106] = 80000000bfa9b063 ffffffffffffffff
  (XEN)  L3[0x008] = 0000000236ffd063 ffffffffffffffff
  (XEN)  L2[0x1b7] = 0000000236ffc063 ffffffffffffffff
  (XEN)  L1[0x16d] = 8000000236f6d161 ffffffffffffffff
  (XEN)
  (XEN) ****************************************
  (XEN) Panic on CPU 4:
  (XEN) FATAL PAGE FAULT
  (XEN) [error_code=0003]
  (XEN) Faulting linear address: ffff830236f6d008
  (XEN) ****************************************
  (XEN)

and rendering the main #DF analysis broken.

The proper fix is to delete cpu_user_regs.es and later, so no
interrupt/exception path can access OoB, but this needs disentangling from the
PV ABI first.

Not-really-fixes: 6001660473 ("x86/shstk: Rework the stack layout to support shadow stacks")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6065a05adf152a556fb9f11a5218c89e41b62893
master date: 2020-10-16 11:55:33 +0100

x86/mwait-idle: customize IceLake server support

On ICX platform, the C1E auto-promotion is enabled by default.
As a result, the CPU might fall into C1E more offen than previous
platforms. So disable C1E auto-promotion and expose C1E as a separate
idle state.

Beside C1 and C1E, the exit latency of C6 was measured
by a dedicated tool. However the exit latency(41us) exposed
by _CST is much smaller than the one we measured(128us). This
is probably due to the _CST uses the exit latency when woken
up from PC0+C6, rather than PC6+C6 when C6 was measured. Choose
the latter as we need the longest latency in theory.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[Linux commit a472ad2bcea479ba068880125d7273fc95c14b70]
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 44ac57af81ff8097e228895738b911ca819bda19
master date: 2020-10-15 12:29:11 +0200

x86: fix resource leaks on arch_vcpu_create() error path

{hvm,pv}_vcpu_initialise() have always kind of been meant to be the
final possible source of errors in arch_vcpu_create(), hence not
requiring any unrolling of what they've done on the error path. (Of
course this may change once the various involved paths all have become
idempotent.)

But even beyond this aspect I think it is more logical to do policy
initialization ahead of the calling of these two functions, as they may
in principle want to access it.

Fixes: 4187f79dc718 ("x86/msr: introduce struct msr_vcpu_policy")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6a34e67c118408ebdd62bfa7be76598ca040f170
master date: 2020-10-14 14:03:38 +0200

x86/vLAPIC: don't leak regs page from vlapic_init() upon error

Fixes: 8a981e0bf25e ("Make map_domain_page_global fail")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8a62dee9ceff3056c7e0bd9632bac39bee2a51b3
master date: 2020-10-09 17:20:11 +0100

x86/S3: Restore CR4 earlier during resume

c/s 4304ff420e5 "x86/S3: Drop {save,restore}_rest_processor_state()
completely" moved CR4 restoration up into C, to account for the fact that MCE
was explicitly handled later.

However, time_resume() ends up making an EFI Runtime Service call, and EFI
explodes without OSFXSR, presumably when trying to spill %xmm registers onto
the stack.

Given this codepath, and the potential for other issues of a similar kind (TLB
flushing vs INVPCID, HVM logic vs VMXE, etc), restore CR4 in asm before
entering C.

Ignore the previous MCE special case, because its not actually necessary. The
handler is already suitably configured from before suspend.

Fixes: 4304ff420e5 ("x86/S3: Drop {save,restore}_rest_processor_state() completely")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: 7f66c0dc41ae5f770c614e516810eb1f336e2470
master date: 2020-10-06 12:28:37 +0100

xen/domain: check IOMMU options doesn't contain unknown bits set

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 59b27f360e3d9dc0378c1288e67a91fa41a77158
master date: 2020-10-02 08:38:50 +0200

evtchn/fifo: use stable fields when recording "last queue" information

Both evtchn->priority and evtchn->notify_vcpu_id could change behind the
back of evtchn_fifo_set_pending(), as for it - in the case of
interdomain channels - only the remote side's per-channel lock is held.
Neither the queue's priority nor the vCPU's vcpu_id fields have similar
properties, so they seem better suited for the purpose. In particular
they reflect the respective evtchn fields' values at the time they were
used to determine queue and vCPU.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 6f6f07b64cbe90e54f8e62b4d6f2404cf5306536
master date: 2020-10-02 08:37:35 +0200

x86/S3: fix shadow stack resume path

Fix the resume path to load the shadow stack pointer from saved_ssp (not
saved_rsp), to match what suspend path does.

Fixes: 633ecc4a7cb2 ("x86/S3: Save and restore Shadow Stack configuration")
Backport: 4.14
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4bdbf746ac9152e70f264f87db4472707da805ce
master date: 2020-09-28 10:43:10 +0200

x86/pv: Don't deliver #GP for a SYSENTER with NT set

It is a matter of guest kernel policy what to do with offending userspace, and
terminating said userspace may not be the action chosen.

Linux explicitly tolerates this case.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: fdac951560 ("x86: clear EFLAGS.NT in SYSENTER entry path")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d4a04349895edc5a5868274b906ba61ef24f47
master date: 2020-09-24 21:02:35 +0100

x86/pv: Don't clobber NT on return-to-guest

A 64bit IRET can restore NT - the faulting case is when NT is set in the live
flags. This change had an unintended consequence of causing the NT flag to
spontaneously disappear from guest context whenever a interrupt/exception
occurred.

In combination with a SYSENTER which sets both TF and NT, Xen's handling of
the #DB exceptions clears NT before it is even recorded suitably in the guest
kernel's view of what userspace was doing.

Reported-by: Andy Lutomirski <luto@kernel.org>
Fixes: 0e47f92b0 ("x86: force EFLAGS.IF on when exiting to PV guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5bcac985498ed83d89666959175ca9c9ed561ae1
master date: 2020-09-24 21:02:35 +0100

AMD/IOMMU: ensure suitable ordering of DTE modifications

DMA and interrupt translation should be enabled only after other
applicable DTE fields have been written. Similarly when disabling
translation or when moving a device between domains, translation should
first be disabled, before other entry fields get modified. Note however
that the "moving" aspect doesn't apply to the interrupt remapping side,
as domain specifics are maintained in the IRTEs here, not the DTE. We
also never disable interrupt remapping once it got enabled for a device
(the respective argument passed is always the immutable iommu_intremap).

This is part of XSA-347.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 0514a3a25fb9ebff5d75cc8f00a9229385300858
master date: 2020-10-20 14:23:12 +0200

AMD/IOMMU: update live PTEs atomically

Updating a live PTE bitfield by bitfield risks the compiler re-ordering
the individual updates as well as splitting individual updates into
multiple memory writes. Construct the new entry fully in a local
variable, do the check to determine the flushing needs on the thus
established new entry, and then write the new entry by a single insn.

Similarly using memset() to clear a PTE is unsafe, as the order of
writes the function does is, at least in principle, undefined.

This is part of XSA-347.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 3b055121c5410e2c3105d6d06aa24ca0d58868cd
master date: 2020-10-20 14:22:52 +0200

AMD/IOMMU: convert amd_iommu_pte from struct to union

This is to add a "raw" counterpart to the bitfield equivalent. Take the
opportunity and
- convert fields to bool / unsigned int,
- drop the naming of the reserved field,
- shorten the names of the ignored ones.

This is part of XSA-347.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 73f62c7380edf07469581a3049aba98abd63b275
master date: 2020-10-20 14:22:26 +0200

IOMMU: hold page ref until after deferred TLB flush

When moving around a page via XENMAPSPACE_gmfn_range, deferring the TLB
flush for the "from" GFN range requires that the page remains allocated
to the guest until the TLB flush has actually occurred. Otherwise a
parallel hypercall to remove the page would only flush the TLB for the
GFN it has been moved to, but not the one is was mapped at originally.

This is part of XSA-346.

Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ")
Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 5777a3742d88ff1c0ebc626ceb4fd47f9b3dc6d5
master date: 2020-10-20 14:21:32 +0200

IOMMU: suppress "iommu_dont_flush_iotlb" when about to free a page

Deferring flushes to a single, wide range one - as is done when
handling XENMAPSPACE_gmfn_range - is okay only as long as
pages don't get freed ahead of the eventual flush. While the only
function setting the flag (xenmem_add_to_physmap()) suggests by its name
that it's only mapping new entries, in reality the way
xenmem_add_to_physmap_one() works means an unmap would happen not only
for the page being moved (but not freed) but, if the destination GFN is
populated, also for the page being displaced from that GFN. Collapsing
the two flushes for this GFN into just one (end even more so deferring
it to a batched invocation) is not correct.

This is part of XSA-346.

Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: dea460d86957bf1425a8a1572626099ac3f165a8
master date: 2020-10-20 14:21:09 +0200

x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1ce75e99d75907aaffae05fcf658a833802bce49
master date: 2020-10-20 14:20:19 +0200

x86/mm: Refactor modify_xen_mappings to have one exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b733f8a8b8db83f2d438cab3adb38b387cecfce0
master date: 2020-10-20 14:19:55 +0200